# Table of Contents

- Introduction
- Clusters Scheme
- RAW Tags
- Processed Tags
- Python Script
    - Requirements
    - Usage
- Dictionary Contents

# Introduction

This readme file explains the current dataset structure and contents. It was developed by Transparency in Algorithms Group (RISE), Nicosia, Cyprus.

# Clusters Scheme

After processing all taggers, image tagging (APIs) and crowd-workers data (US, India), we came up with some super-clusters and sub-clusters as shown below:

- Demographic [super]
    - Masculine
    - Feminine
    - Nonbinary
    - Age
    - Race

- Concrete [super]
    - Actions
    - Body
    - Hair
    - Clothing
    - Colors
    - Meta
    - Shape

- Abstract [super]
    - Judgement
    - Traits
    - Emotion
    - Occupation

- Inflammatory [super]

- Other [super]
    - Ambiguous
    - Inconclusive
    - Lack
    - Misc


# RAW Tags

The RAW Tags directory includes an .XLS file in which you can find a sheet per tagger, along with the RAW tags given. More specifically, the first column refers to the Image Identifier, the second, third and fourth refers to the race, gender and approximate age of the depicted person respectively, and the rest are the RAW tags given. For the crowd-workers data (US and India) there are two additional columns. More specifically the fifth and sixth column include the crowd-worker's race and gender.

Please note that for the US and India Crowd-workers, the sheets include three responses (set of tags) per Image Identifier (depicted person).

# Processed Tags

The Processed Tags directory includes an .XLS file in which you can find a sheet per tagger, along with the processed tags given. More specifically, the first column refers to the Image Identifier, the second and third in the race and gender of the depicted person respectively, and the rest are the processed tags given.

Please note that for the US and India Crowd-workers, the sheets include at least 3 to 4 responses (set of tags) per Image Identifier (depicted person).

These files are basically the input data for the python script which calculates the vectors. Note that the crowd-workers data should follow a structure that represents only the unique Image Identifiers and a set of processed tags in this input file.

# Python Script

There is a count_categories.py script that was created by Transparency in Algorithms Group, RISE, Nicosia, Cyprus for research purposes. This script calculates and exports the vectors of the relative frequency per cluster/dimension (i.e., the proportion of tags in a description that map onto a given concept cluster), given an input file in CSV format, that has the Image Identifier along with a set of processed labels/tags per image.

## Requirements

In order to calculate the vectors and execute some clustering process, the script needs to coexist with the DICTIONARY directory as 'dict/' in the same folder.

## Usage

python count_categories.py --input '/path/to/an/example_input_file.csv' > example_vectors.csv
or
./count_categories.py --input '/path/to/an/example_input_file.csv' > example_vectors.csv

# Dictionary Contents

In the 'dict' directory you can find a set of CSV files that are mapping to a corresponding super/sub-cluster and its tags. The cluster name is written in the filename of the files and the tags are representing as their content in one column.

Please note that some of the super-clusters are not representing here, as they can be calculated as the union of their sub-cluster tags. (22 files in total)

- actions.csv
- age.csv
- ambiguous.csv
- body.csv
- clothing.csv
- colors.csv
- emotion.csv
- feminine.csv
- hair.csv
- inconclusive.csv
- inflammatory.csv
- judgement.csv
- lack.csv
- masculine.csv
- meta.csv
- misc.csv
- nonbinary.csv
- occupation.csv
- race.csv
- shape.csv
- traits.csv

Also, it includes a specific dictionary for spellcheck corrections and replacements (1 file in total). This file has two columns: the first one is the original tag found in the RAW data of crowd-workers and the second one is the corrected/spellchecked word that replaced the original one. 

- corrections_dict.csv

This directory is crucial for the right execusion of the python script attached. Without this directory you cannot count the frequency proportion vectors for each super/sub-cluster.
